Chapter 9 - New Developments: Topic Modeling with BERTopic!
Contents
Chapter 9 - New Developments: Topic Modeling with BERTopic!#
2022 July 30

What is BERTopic?#
As part of NLP analysis, it’s likely that at some point you will be asked, “What topics are most common in these documents?”
Though related, this question is definitely distinct from a query like “What words or phrases are most common in this corpus?”
For example, the sentences “I enjoy learning to code.” and “Educating myself on new computer programming techniques makes me happy!” contain wholly unique tokens, but encode a similar sentiment.
If possible, we would like to extract generalized topics instead of specific words/phrases to get an idea of what a document is about.
This is where BERTopic comes in! BERTopic is a cutting-edge methodology that leverages the transformers defining the base BERT technique along with other ML tools to provide a flexible and powerful topic modeling module (with great visualization support as well!)
In this notebook, we’ll go through the operation of BERTopic’s key functionalities and present resources for further exploration.
Required installs:#
# Installs the base bertopic module:
# !pip install bertopic
# If you want to use other transformers/language backends, it may require additional installs:
# !pip install bertopic[flair] # can substitute 'flair' with 'gensim', 'spacy', 'use'
# bertopic also comes with its own handy visualization suite:
# !pip install bertopic[visualization]
Data sourcing#
For this exercise, we’re going to use a popular data set, ‘20 Newsgroups,’ which contains ~18,000 newsgroups posts on 20 topics. This dataset is readily available to us through Scikit-Learn:
import bertopic
from bertopic import BERTopic
from sklearn.datasets import fetch_20newsgroups
documents = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))['data']
print(documents[0]) # Any ice hockey fans?
2023-04-21 12:50:20.154511: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
I am sure some bashers of Pens fans are pretty confused about the lack
of any kind of posts about the recent Pens massacre of the Devils. Actually,
I am bit puzzled too and a bit relieved. However, I am going to put an end
to non-PIttsburghers' relief with a bit of praise for the Pens. Man, they
are killing those Devils worse than I thought. Jagr just showed you why
he is much better than his regular season stats. He is also a lot
fo fun to watch in the playoffs. Bowman should let JAgr have a lot of
fun in the next couple of games since the Pens are going to beat the pulp out of Jersey anyway. I was very disappointed not to see the Islanders lose the final
regular season game. PENS RULE!!!
Creating a BERTopic model:#
Using the BERTopic module requires you to fetch an instance of the model. When doing so, you can specify multiple different parameters including:
language-> the language of your documentsmin_topic_size-> the minimum size of a topic; increasing this value will lead to a lower number of topicsembedding_model-> what model you want to use to conduct your word embeddings; many are supported!
For a full list of the parameters and their significance, please see https://github.com/MaartenGr/BERTopic/blob/master/bertopic/_bertopic.py.
Of course, you can always use the default parameter values and instantiate your model as
model = BERTopic(). Once you’ve done so, you’re ready to fit your model to your documents!
Example instantiation:#
from sklearn.feature_extraction.text import CountVectorizer
# example parameter: a custom vectorizer model can be used to remove stopwords from the documents:
stopwords_vectorizer = CountVectorizer(ngram_range=(1, 2), stop_words='english')
# instantiating the model:
model = BERTopic(vectorizer_model = stopwords_vectorizer)
Fitting the model:#
The first step of topic modeling is to fit the model to the documents:
topics, probs = model.fit_transform(documents)
---------------------------------------------------------------------------
KeyboardInterrupt Traceback (most recent call last)
Cell In[4], line 1
----> 1 topics, probs = model.fit_transform(documents)
File ~/opt/anaconda3/lib/python3.8/site-packages/bertopic/_bertopic.py:282, in BERTopic.fit_transform(self, documents, embeddings, y)
279 if embeddings is None:
280 self.embedding_model = select_backend(self.embedding_model,
281 language=self.language)
--> 282 embeddings = self._extract_embeddings(documents.Document,
283 method="document",
284 verbose=self.verbose)
285 logger.info("Transformed documents to Embeddings")
286 else:
File ~/opt/anaconda3/lib/python3.8/site-packages/bertopic/_bertopic.py:1335, in BERTopic._extract_embeddings(self, documents, method, verbose)
1333 embeddings = self.embedding_model.embed_words(documents, verbose)
1334 elif method == "document":
-> 1335 embeddings = self.embedding_model.embed_documents(documents, verbose)
1336 else:
1337 raise ValueError("Wrong method for extracting document/word embeddings. "
1338 "Either choose 'word' or 'document' as the method. ")
File ~/opt/anaconda3/lib/python3.8/site-packages/bertopic/backend/_base.py:69, in BaseEmbedder.embed_documents(self, document, verbose)
55 def embed_documents(self,
56 document: List[str],
57 verbose: bool = False) -> np.ndarray:
58 """ Embed a list of n words into an n-dimensional
59 matrix of embeddings
60
(...)
67 that each have an embeddings size of `m`
68 """
---> 69 return self.embed(document, verbose)
File ~/opt/anaconda3/lib/python3.8/site-packages/bertopic/backend/_sentencetransformers.py:63, in SentenceTransformerBackend.embed(self, documents, verbose)
49 def embed(self,
50 documents: List[str],
51 verbose: bool = False) -> np.ndarray:
52 """ Embed a list of n documents/words into an n-dimensional
53 matrix of embeddings
54
(...)
61 that each have an embeddings size of `m`
62 """
---> 63 embeddings = self.embedding_model.encode(documents, show_progress_bar=verbose)
64 return embeddings
File ~/opt/anaconda3/lib/python3.8/site-packages/sentence_transformers/SentenceTransformer.py:184, in SentenceTransformer.encode(self, sentences, batch_size, show_progress_bar, output_value, convert_to_numpy, convert_to_tensor, device, normalize_embeddings)
182 for start_index in trange(0, len(sentences), batch_size, desc="Batches", disable=not show_progress_bar):
183 sentences_batch = sentences_sorted[start_index:start_index+batch_size]
--> 184 features = self.tokenize(sentences_batch)
185 features = batch_to_device(features, device)
187 with torch.no_grad():
File ~/opt/anaconda3/lib/python3.8/site-packages/sentence_transformers/SentenceTransformer.py:336, in SentenceTransformer.tokenize(self, texts)
332 def tokenize(self, texts: Union[List[str], List[Dict], List[Tuple[str, str]]]):
333 """
334 Tokenizes the texts
335 """
--> 336 return self._first_module().tokenize(texts)
File ~/opt/anaconda3/lib/python3.8/site-packages/sentence_transformers/models/Transformer.py:91, in Transformer.tokenize(self, texts)
87 if self.do_lower_case:
88 to_tokenize = [[s.lower() for s in col] for col in to_tokenize]
---> 91 output.update(self.tokenizer(*to_tokenize, padding=True, truncation='longest_first', return_tensors="pt", max_length=self.max_seq_length))
92 return output
File ~/opt/anaconda3/lib/python3.8/site-packages/transformers/tokenization_utils_base.py:2197, in PreTrainedTokenizerBase.__call__(self, text, text_pair, add_special_tokens, padding, truncation, max_length, stride, is_split_into_words, pad_to_multiple_of, return_tensors, return_token_type_ids, return_attention_mask, return_overflowing_tokens, return_special_tokens_mask, return_offsets_mapping, return_length, verbose, **kwargs)
2195 if is_batched:
2196 batch_text_or_text_pairs = list(zip(text, text_pair)) if text_pair is not None else text
-> 2197 return self.batch_encode_plus(
2198 batch_text_or_text_pairs=batch_text_or_text_pairs,
2199 add_special_tokens=add_special_tokens,
2200 padding=padding,
2201 truncation=truncation,
2202 max_length=max_length,
2203 stride=stride,
2204 is_split_into_words=is_split_into_words,
2205 pad_to_multiple_of=pad_to_multiple_of,
2206 return_tensors=return_tensors,
2207 return_token_type_ids=return_token_type_ids,
2208 return_attention_mask=return_attention_mask,
2209 return_overflowing_tokens=return_overflowing_tokens,
2210 return_special_tokens_mask=return_special_tokens_mask,
2211 return_offsets_mapping=return_offsets_mapping,
2212 return_length=return_length,
2213 verbose=verbose,
2214 **kwargs,
2215 )
2216 else:
2217 return self.encode_plus(
2218 text=text,
2219 text_pair=text_pair,
(...)
2235 **kwargs,
2236 )
File ~/opt/anaconda3/lib/python3.8/site-packages/transformers/tokenization_utils_base.py:2382, in PreTrainedTokenizerBase.batch_encode_plus(self, batch_text_or_text_pairs, add_special_tokens, padding, truncation, max_length, stride, is_split_into_words, pad_to_multiple_of, return_tensors, return_token_type_ids, return_attention_mask, return_overflowing_tokens, return_special_tokens_mask, return_offsets_mapping, return_length, verbose, **kwargs)
2372 # Backward compatibility for 'truncation_strategy', 'pad_to_max_length'
2373 padding_strategy, truncation_strategy, max_length, kwargs = self._get_padding_truncation_strategies(
2374 padding=padding,
2375 truncation=truncation,
(...)
2379 **kwargs,
2380 )
-> 2382 return self._batch_encode_plus(
2383 batch_text_or_text_pairs=batch_text_or_text_pairs,
2384 add_special_tokens=add_special_tokens,
2385 padding_strategy=padding_strategy,
2386 truncation_strategy=truncation_strategy,
2387 max_length=max_length,
2388 stride=stride,
2389 is_split_into_words=is_split_into_words,
2390 pad_to_multiple_of=pad_to_multiple_of,
2391 return_tensors=return_tensors,
2392 return_token_type_ids=return_token_type_ids,
2393 return_attention_mask=return_attention_mask,
2394 return_overflowing_tokens=return_overflowing_tokens,
2395 return_special_tokens_mask=return_special_tokens_mask,
2396 return_offsets_mapping=return_offsets_mapping,
2397 return_length=return_length,
2398 verbose=verbose,
2399 **kwargs,
2400 )
File ~/opt/anaconda3/lib/python3.8/site-packages/transformers/tokenization_utils.py:553, in PreTrainedTokenizer._batch_encode_plus(self, batch_text_or_text_pairs, add_special_tokens, padding_strategy, truncation_strategy, max_length, stride, is_split_into_words, pad_to_multiple_of, return_tensors, return_token_type_ids, return_attention_mask, return_overflowing_tokens, return_special_tokens_mask, return_offsets_mapping, return_length, verbose, **kwargs)
550 second_ids = get_input_ids(pair_ids) if pair_ids is not None else None
551 input_ids.append((first_ids, second_ids))
--> 553 batch_outputs = self._batch_prepare_for_model(
554 input_ids,
555 add_special_tokens=add_special_tokens,
556 padding_strategy=padding_strategy,
557 truncation_strategy=truncation_strategy,
558 max_length=max_length,
559 stride=stride,
560 pad_to_multiple_of=pad_to_multiple_of,
561 return_attention_mask=return_attention_mask,
562 return_token_type_ids=return_token_type_ids,
563 return_overflowing_tokens=return_overflowing_tokens,
564 return_special_tokens_mask=return_special_tokens_mask,
565 return_length=return_length,
566 return_tensors=return_tensors,
567 verbose=verbose,
568 )
570 return BatchEncoding(batch_outputs)
File ~/opt/anaconda3/lib/python3.8/site-packages/transformers/tokenization_utils.py:601, in PreTrainedTokenizer._batch_prepare_for_model(self, batch_ids_pairs, add_special_tokens, padding_strategy, truncation_strategy, max_length, stride, pad_to_multiple_of, return_tensors, return_token_type_ids, return_attention_mask, return_overflowing_tokens, return_special_tokens_mask, return_length, verbose)
599 batch_outputs = {}
600 for first_ids, second_ids in batch_ids_pairs:
--> 601 outputs = self.prepare_for_model(
602 first_ids,
603 second_ids,
604 add_special_tokens=add_special_tokens,
605 padding=PaddingStrategy.DO_NOT_PAD.value, # we pad in batch afterward
606 truncation=truncation_strategy.value,
607 max_length=max_length,
608 stride=stride,
609 pad_to_multiple_of=None, # we pad in batch afterward
610 return_attention_mask=False, # we pad in batch afterward
611 return_token_type_ids=return_token_type_ids,
612 return_overflowing_tokens=return_overflowing_tokens,
613 return_special_tokens_mask=return_special_tokens_mask,
614 return_length=return_length,
615 return_tensors=None, # We convert the whole batch to tensors at the end
616 prepend_batch_axis=False,
617 verbose=verbose,
618 )
620 for key, value in outputs.items():
621 if key not in batch_outputs:
File ~/opt/anaconda3/lib/python3.8/site-packages/transformers/tokenization_utils_base.py:2700, in PreTrainedTokenizerBase.prepare_for_model(self, ids, pair_ids, add_special_tokens, padding, truncation, max_length, stride, pad_to_multiple_of, return_tensors, return_token_type_ids, return_attention_mask, return_overflowing_tokens, return_special_tokens_mask, return_offsets_mapping, return_length, verbose, prepend_batch_axis, **kwargs)
2698 overflowing_tokens = []
2699 if truncation_strategy != TruncationStrategy.DO_NOT_TRUNCATE and max_length and total_len > max_length:
-> 2700 ids, pair_ids, overflowing_tokens = self.truncate_sequences(
2701 ids,
2702 pair_ids=pair_ids,
2703 num_tokens_to_remove=total_len - max_length,
2704 truncation_strategy=truncation_strategy,
2705 stride=stride,
2706 )
2708 if return_overflowing_tokens:
2709 encoded_inputs["overflowing_tokens"] = overflowing_tokens
File ~/opt/anaconda3/lib/python3.8/site-packages/transformers/tokenization_utils_base.py:2817, in PreTrainedTokenizerBase.truncate_sequences(self, ids, pair_ids, num_tokens_to_remove, truncation_strategy, stride)
2815 window_len = 1
2816 overflowing_tokens.extend(ids[-window_len:])
-> 2817 ids = ids[:-1]
2818 else:
2819 if not overflowing_tokens:
KeyboardInterrupt:
.fit_transform()returns two outputs:topicscontains mappings of inputs (documents) to their modeled topic (alternatively, cluster)probscontains a list of probabilities that an input belongs to their assigned topic
Note:
fit_transform()can be substituted withfit().fit_transform()allows for the prediction of new documents but demands additional computing power/time.
Viewing topic modeling results:#
The BERTopic module has many built-in methods to view and analyze your fitted model topics. Here are some basics:
# view your topics:
topics_info = model.get_topic_info()
# get detailed information about the top five most common topics:
print(topics_info.head(5))
Topic Count Name
0 -1 6269 -1_file_use_program_information
1 0 1820 0_game_games_players_season
2 1 576 1_clipper_chip_encryption_nsa
3 2 525 2_ken huh_ites yep_huh art_cheek ken
4 3 461 3_israel_israeli_jews_palestinian
When examining topic information, you may see a topic with the assigned number ‘-1.’ Topic -1 refers to all input outliers which do not have a topic assigned and should typically be ignored during analysis.
Forcing documents into a topic could decrease the quality of the topics generated, so it’s usually a good idea to allow the model to discard inputs into this ‘Topic -1’ bin.
# access a single topic:
print(model.get_topic(topic=0)) # .get_topics() accesses all topics
[('game', 0.008897079120040405), ('games', 0.006239805375203304), ('players', 0.005500447313911373), ('season', 0.0054190392544866215), ('hockey', 0.0053334000238632676), ('league', 0.004355709419398449), ('teams', 0.0040580511632626205), ('baseball', 0.0038206398834612124), ('nhl', 0.0035702345600771056), ('gm', 0.003050818663267911)]
# get representative documents for a specific topic:
print(model.get_representative_docs(topic=0)) # omit the 'topic' parameter to get docs for all topics
["On two separate occasions I saw Dick Allen (back when he was Richie)\nhomer at Shea off the middle of the black centerfield hitter's\nbackground screen. I think both shots would have traveled 500 feet.", '\nDale Hawerchuk and Troy Murray were both captains of the Jets\nwhen they were traded. (Murray this year in mid-season, Hawerchuk\na few years ago in the off-season.)', 'Disclaimer -- This is for fun.\n\nIn my computerized baseball game, I keep track of a category called\n"stolen hits", defined as a play made that "an average fielder would not\nmake with average effort." Using the 1992 Defensive Averages posted\nby Sherri Nichols (Thanks Sherri!), I\'ve figured out some defensive stats\nfor the centerfielders. Hits Stolen have been redefined as "Plays Juan\nGonzalez would not have made."\n\nOK, I realize that\'s unfair. Juan\'s probably the victim of pitching staff,\nfluke shots, and a monster park factor. But let\'s put it this way: If we\nreplaced every centerfielder in the league with someone with Kevin\'s 55.4% out\nmaking ability, how many extra hits would go by?\n\nTo try and correlate it to reality a little more, I\'ve calculated Net\nHits Stolen, based on the number of outs made compared to what a league\naverage fielder would make. By the same method I\'ve calculated Net Extra \nBases (doubles and triples let by).\n\nFinally, I throw all this into a a formula I call Defensive Contribution, or\nDCON :->. Basically, it represents the defensive contribution of a player.\nI add this number to OPS to get DOPS (Defense + Onbase Plus Slug), which\nshould represent the player\'s total contribution to the team. So don\'t\ntake it too seriously. The formula for DCON appears at the end of this\narticle.\n\nThe short version -- definition of terms\nHS -- Hits Stolen -- Extra outs compared to Kurt Stillwell\nNHS -- Net Hits Stolen -- Extra outs compared to average fielder\nNDP -- Net Double Plays -- Extra double plays turned compared to avg fielder\nNEB -- Net Extra Bases -- Extra bases prevented compared to avg. fielder\nDCON -- Defensive Contribution -- bases and hits prevented, as a rate.\nDOPS -- DCON + OPS -- quick & dirty measure of player\'s total contribution.\n\nNational League\n\nName HS NHS NEB DCON DOPS\nNixon, O. 34 12 15 .083 .777\nGrissom, M. 48 18 12 .072 .812\nJackson, D. 46 13 20 .060 .735\nLewis, D. 25 8 -6 .029 .596\nDykstra, L. 25 5 -5 .013 .794\nDascenzo, D. 10 -5 10 .001 .616\nFinley, S. 32 -2 2 -.003 .759\nLankford, R. 39 4 -12 -.007 .844\nMartinez, D. 21 5 -16 -.017 .660\nVanSlyke, A. 30 -4 -17 -.040 .846\nSanders, R. 7 -10 -4 -.059 .759\nButler, B. 1 -29 5 -.088 .716\nJohnson, H. 3 -12 -19 -.118 .548\n\nOrdered by DOPS\n\n.846 VanSlyke\n.844 Lankford\n.812 Grissom\n.794 Dykstra\n.777 Nixon\n.759 Finley\n.759 Sanders\n.735 Jackson\n.730 *NL Average*\n.716 Butler\n.660 Martinez\n.616 Dascenzo\n.596 Lewis\n.548 Johnson\n\nAmerican League\n---------------\n\nName HS NHS NEB DCON DOPS\nLofton, K. 57 32 17 .220 .947\nWilson, W. 47 26 0 .125 .787\nWhite, D. 52 25 28 .119 .812\nFelix, J. 22 0 32 .063 .713\nDevereaux, M. 43 16 0 .047 .832\nMcRae, H. 38 11 -1 .038 .631\nYount, R. 31 8 -3 .022 .737\nKelly, R. 13 -6 -3 -.025 .681\nJohnson, L. 23 -5 -13 -.040 .641\nGriffey, K. 15 -9 -12 -.052 .844\nPuckett, K. 13 -13 -15 -.063 .801\nCuyler, M. 6 -10 -6 -.088 .503\nGonzalez, J. 0 -21 -15 -.095 .738\n\n\nOrder by DOPS\n\n.947 Lofton\n.844 Griffey\n.832 Devereaux\n.812 White\n.801 Puckett\n.787 Wilson\n.738 Gonzalez\n.737 Yount\n.713 Felix\n.709 *AL Average*\n.681 Kelly\n.641 Johnson\n.631 McRae\n.503 Cuyler\n\nMore discussion --\n\nDCON formula: ((NHS + NDP)/PA) + ((NHS + NDP + NEB)/AB)\nWhy such a bizzare formula? Basically, it\'s designed to be added into the\nOPS, with the idea that "a run prevented is as important as a run scored".\nThe extra outs are factored into OBP, while the extra bases removed are \nfactored into SLG. That\'s why I used PA and AB as the divisors.\n\nFor more discussion see the post on Hits Stolen -- First Base 1992\n-- \nDale J. Stephenson |*| (steph@cs.uiuc.edu) |*| Baseball fanatic']
# find topics similar to a key term/phrase:
topics, similarity_scores = model.find_topics("sports", top_n = 5)
print("Most common topics:" + str(topics)) # view the numbers of the top-5 most similar topics
# print the initial contents of the most similar topics
for topic_num in topics:
print('\nContents from topic number: '+ str(topic_num) + '\n')
print(model.get_topic(topic_num))
Most common topics:[0, 30, 100, 5, 119]
Contents from topic number: 0
[('game', 0.008897079120040405), ('games', 0.006239805375203304), ('players', 0.005500447313911373), ('season', 0.0054190392544866215), ('hockey', 0.0053334000238632676), ('league', 0.004355709419398449), ('teams', 0.0040580511632626205), ('baseball', 0.0038206398834612124), ('nhl', 0.0035702345600771056), ('gm', 0.003050818663267911)]
Contents from topic number: 30
[('games', 0.0311986186334776), ('joystick', 0.024854772600992288), ('sega', 0.01843417508718104), ('arcade', 0.01152978960878613), ('snes', 0.010316326486052874), ('joysticks', 0.009755975822819033), ('games sale', 0.009552427913817675), ('sale', 0.00920289688446232), ('sega genesis', 0.008385865867941949), ('sell', 0.0065423678170002195)]
Contents from topic number: 100
[('bike', 0.024284182184133193), ('motorcycle', 0.020949204271212656), ('riding', 0.018080412148776013), ('steering', 0.015196397328246822), ('wheels', 0.011972626079217374), ('riders', 0.011099492579328645), ('gyroscopes', 0.008942235342491037), ('like motorcycle', 0.008942235342491037), ('wheel', 0.008761111668299194), ('traction', 0.008655336102834211)]
Contents from topic number: 5
[('health', 0.007414055207369821), ('cancer', 0.006066041048751265), ('disease', 0.005210870415718458), ('tobacco', 0.005131395661303987), ('medical', 0.005008398841824522), ('hiv', 0.004771926193192804), ('malaria', 0.004156690066707363), ('smokeless tobacco', 0.004077331372228564), ('lyme', 0.003968454429383943), ('medical newsletter', 0.003951881410752244)]
Contents from topic number: 119
[('helmet', 0.13098497242062246), ('cb', 0.02784057359372715), ('helmets', 0.019544646744958757), ('leave helmet', 0.015331961163779528), ('helmet mirror', 0.012603026097472334), ('helmet seat', 0.012603026097472334), ('foam liner', 0.009778563039023093), ('place helmet', 0.009778563039023093), ('weight helmet', 0.009778563039023093), ('fit', 0.009187518590002912)]
Saving/loading models:#
One of the most obvious drawbacks of using the BERTopic technique is the algorithm’s run-time. But, rather than re-running a script every time you want to conduct topic modeling analysis, you can simply save/load models!
# save your model:
# model.save("TAML_ex_model")
# load it later:
# loaded_model = BERTopic.load("TAML_ex_model")
Visualizing topics:#
Although the prior methods can be used to manually examine the textual contents of topics, visualizations can be an excellent way to succinctly communicate the same information.
Depending on the visualization, it can even reveal patterns that would be much harder/impossible to see through textual analysis - like inter-topic distance!
Let’s see some examples!
# Create a 2D representation of your modeled topics & their pairwise distances:
model.visualize_topics()
# Get the words and probabilities of top topics, but in bar chart form!
model.visualize_barchart()
# Evaluate topic similarity through a heat map:
model.visualize_heatmap()
/Users/evanmuzzall/opt/anaconda3/lib/python3.8/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.16.5 and <1.23.0 is required for this version of SciPy (detected version 1.23.5
warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
/Users/evanmuzzall/opt/anaconda3/lib/python3.8/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.16.5 and <1.23.0 is required for this version of SciPy (detected version 1.23.5
warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
/Users/evanmuzzall/opt/anaconda3/lib/python3.8/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.16.5 and <1.23.0 is required for this version of SciPy (detected version 1.23.5
warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
/Users/evanmuzzall/opt/anaconda3/lib/python3.8/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.16.5 and <1.23.0 is required for this version of SciPy (detected version 1.23.5
warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
Conclusion#
Hopefully you’re convinced of how accessible but powerful a technique BERTopic topic modeling can be! There’s plenty more to learn about BERTopic than what we’ve covered here, but you should be ready to get started!
During your adventures, you may find the following resources useful:
Original BERTopic Github: https://github.com/MaartenGr/BERTopic
BERTopic visualization guide: https://maartengr.github.io/BERTopic/getting_started/visualization/visualization.html#visualize-terms
How to use BERT to make a custom topic model: https://towardsdatascience.com/topic-modeling-with-bert-779f7db187e6
Recommended things to look into next include:
how to select the best embedding model for your BERTopic model;
controlling the number of topics your model generates; and
other visualizations and deciding which ones are best for what kinds of documents.
Questions? Please reach out! Anthony Weng, SSDS consultant, is happy to help (contact: ad2weng@stanford.edu)
Exercise#
Repeat the steps in this notebook with your own data. However, real data does not come with a
fetchfunction. What importation steps do you need to consider so your own corpus works?